Fix/coder32b fence extraction#14
Open
wlu03 wants to merge 12 commits into
Open
Conversation
36 per-pattern HO-* checkers (held_out.py) close the fallback gap so held-out rows earn real verdicts (16.5% FAITHFUL). report_2x2 consumes faithfulness_cell and reports four cells x fast/slow. Overall FAITHFUL 26.8% -> 29.1%.
scripts/category_difficulty.py refutes the IS-hardest/AL-SR-easiest priors: DS hardest by pass@1 (47.9%, bottom-2 for 14/15 models), MI easiest (81.3%); IS is hardest only to speed up (1.24x geomean). README finding added.
scripts/cross_pattern_transfer.py: per-category pass@1 correlates only moderately across 15 models (mean Spearman +0.50). Clusters AL-CF +0.77, DS-IS +0.70; MI most independent; AL best predictor of overall skill (+0.80).
modal_app/finetune_weak3.py trains QLoRA on r1-distill-qwen-7b, yi-coder-9b, opencoder-8b (held-out excluded), merges to 16-bit, stages on the pdob-finetuned volume. inference.py registers *-ft model keys from that volume so eval is the unchanged pipeline.
Swap targets to the weakest fine-tune-friendly models (rescue experiment): r1-distill-qwen-1.5b (2.8%), r1-distill-qwen-7b (26.7%), qwen2.5-coder-1.5b (59.4%, non-reasoning control). inference.py *-ft keys synced.
Eval the 3 fine-tuned weak models on the 178 unseen held-out variants, paired Wilcoxon vs base. Result: no positive transfer — non-reasoning qwen2.5-coder-1.5b regresses significantly (held-out pass@1 -39/-50pp, p=0.001; hallucinated externs, catastrophic forgetting); reasoning models nudge up off ~0 baselines but not significantly. README finding #7.
modal_app/finetune_sweep.py: grid over epochs/lr/LoRA-rank/dropout + completion-only loss (Unsloth train_on_responses_only) + replay data (CodeAlpaca-20k mix) to fight the phase-1 overfitting. 2 subjects (qwen2.5-coder-1.5b regressor, r1-distill-7b) x 7 configs; inference.py registers the *-ft variants for held-out eval.
prepare_indist_split.py holds out whole base-pattern variants (79) for a clean
in-distribution test (the old random split leaked 255/273 variants). finetune_indist.py
sweeps epochs {1,3,6,10} on the clean split to map the in-dist-transfer vs OOD-forgetting
crossover (researched recipe: lr 2e-4, alpha=2r, dropout 0.1, completion-only).
evaluate_all_modal spawns generation+CSV-write on Modal (survives --detach disconnect, unlike evaluate_all's .map). score_modal.py scores cells on Modal CPU; compiler.py honors PDOB_*_TIMEOUT env so broken candidates die fast.
Interrupted merges left config.json + tokenizer but no safetensors, which the idempotency check treated as 'already merged' (so they were skipped) and vLLM then couldn't load. Now check for safetensors and wipe+retrain partials. Add crossover_tick.sh to idempotently drive the epoch-sweep eval->score->crossover.
The orchestrator checkpoints incrementally, so a still-generating eval CSV looked ready and got scored on a partial (26/257 rows). Only score when all 257 in-dist+OOD rows are present; only mark DONE when the scored CSV is complete.
A prematurely-scored cell (e.g. 36 rows from a partial eval) polluted the table with tiny-denominator garbage. Require ~257 rows or mark the cell incomplete.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.